1 Qualitative Predictors

The simplest situation where dummy variables might be used in a regression model is when the qualitative predictor has only two levels. The regression model for a single quantitative predictor \((x_1)\) and a dummy variable \((D_1)\) is written

\[\begin{equation} Y = \beta_0 + \beta_1 x_1 + \beta_2 D_1 + \beta_3 x_1 D_1 + \varepsilon \tag{1.1} \end{equation}\] where \[\begin{equation*} D_1 = \begin{cases} 0 &\text{for the first level}\\ 1 &\text{for the second level.} \end{cases} \end{equation*}\]

The model in (1.1) when \(D_1\) has two levels will yield one of four possible scenarios, as shown in Figure 1.1. This type of model requires the user to answer three basic questions:

Four possible results for a single dummy variable with two levels.  Graph I has the intercept and the slope the same for both levels of the dummy variable.  Graph II has the two lines with the same slope, but different intercepts.  Graph III shows the two fitted lines with the same intercept but different slopes.  Graph IV shows the two lines with different intercepts and different slopes.

Figure 1.1: Four possible results for a single dummy variable with two levels. Graph I has the intercept and the slope the same for both levels of the dummy variable. Graph II has the two lines with the same slope, but different intercepts. Graph III shows the two fitted lines with the same intercept but different slopes. Graph IV shows the two lines with different intercepts and different slopes.


To address whether the lines are the same, the null hypothesis \(H_0: \beta_2 = \beta_3 = 0\) must be tested. One way to perform the test is to use the general linear test statistic based on the full model found in (1.1) and the reduced model \(Y = \beta_0 + \beta_1 x_1 + \varepsilon\). If the null hypothesis is not rejected, the interpretation is that there is one line present (the intercept and the slope are the same for both levels of the dummy variable). This is the case for graph I of Figure 1.1. If the null hypothesis is rejected, either the slopes, the intercepts, or possibly both the slope and the intercept are different for the different levels of the dummy variable, as seen in graphs II, III, and IV of Figure 1.1, respectively.

To answer whether the slopes are the same, the null hypothesis \(H_0: \beta_3 = 0\) must be tested. If the null hypothesis is not rejected, the two lines have the same slope, but different intercepts, as shown in graph II of Figure 1.1. The two parallel lines that result when \(\beta_3 = 0\) are

\[\begin{equation*} Y = \beta_0 + \beta_1 x_1 + \varepsilon \text{ for } (D_1 = 0) \quad\text{and}\quad Y = (\beta_0 +\beta_2) + \beta_1 x_1 + \varepsilon \text{ for } (D_1 = 1). \end{equation*}\]

When \(H_0: \beta_3=0\) is rejected, one concludes that the two fitted lines are not parallel as in graphs III and IV of Figure 1.1.

To answer whether the intercepts are the same, the null hypothesis \(H_0: \beta_2 = 0\) for model (1.1) must be tested. The reduced model for this test is \(Y = \beta_0 + \beta_1 x_1 + \beta_3 x_1 D_1 + \varepsilon\). If the null hypothesis is not rejected, the two fitted lines have the same intercept but different slopes:

\[\begin{equation*} Y = \beta_0 + \beta_1 x_1 + \varepsilon \text{ for } (D_1 = 0) \quad\text{and}\quad Y = \beta_0 + (\beta_1 +\beta_3) x_1 + \varepsilon \text{ for } (D_1 = 1). \end{equation*}\]

Graph III of Figure 1.1 represents this situation. If the null hypothesis is rejected, one concludes that the two lines have different intercepts, as in graphs II and IV of Figure 1.1.

2 Example

Suppose a realtor wants to model the appraised price of an apartment as a function of the predictors living area (in m\(^2\)) and the presence or absence of elevators. Consider the data frameVIT2005, which contains data about apartments in Vitoria, Spain, including totalprice, area, and elevator, which are the appraised apartment value in Euros, living space in square meters, and the absence or presence of at least one elevator in the building, respectively. The realtor first wants to know if there is any relationship between appraised price \((Y)\) and living area \((x_1)\). Next, the realtor wants to know how adding a dummy variable for whether or not an elevator is present changes the relationship: Are the lines the same? Are the slopes the same? Are the intercepts the same?

2.1 Solution (is there a realationship between totalprice and area?):

library(tidyverse)
library(PASWR2)
VIT2005 <- VIT2005 %>% 
  mutate(elevator = factor(elevator, labels = c("No", "Yes")))
mod_simple <- lm(totalprice ~ area, data = VIT2005)
summary(mod_simple)

Call:
lm(formula = totalprice ~ area, data = VIT2005)

Residuals:
    Min      1Q  Median      3Q     Max 
-156126  -21564   -2155   19493  120674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40822.4    12170.1   3.354  0.00094 ***
area          2704.8      133.6  20.243  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40810 on 216 degrees of freedom
Multiple R-squared:  0.6548,    Adjusted R-squared:  0.6532 
F-statistic: 409.8 on 1 and 216 DF,  p-value: < 2.2e-16
library(moderndive)
get_regression_table(mod_simple)
# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept   40822.    12170.      3.35   0.001   16835.   64810.
2 area         2705.      134.     20.2    0        2441.    2968.
ggplot(data = VIT2005, aes(x = area, y = totalprice)) + 
  geom_point() + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE)
Scatterplot of `totalprice` versus `area` with the fitted regression line superimposed from `mod_simple`

Figure 2.1: Scatterplot of totalprice versus area with the fitted regression line superimposed from mod_simple

A linear regression model of the form

\[\begin{equation} Y= \beta_0 +\beta_1 x_1 + \varepsilon \tag{2.1} \end{equation}\]

is fit yielding

\[\begin{equation*} \widehat{Y}_i = 4.0822416\times 10^{4} + 2704.7510279 x_{i1} \end{equation*}\]

and a scatterplot of totalprice versus area with the fitted regression line superimposed over the scatterplot is shown in Figure 2.1.

Based on Figure 2.1, there appears to be a linear relationship between appraised price and living area. Further, this relationship is statistically significant, as the p-value for testing \(H_0: \beta_1=0\) versus \(H_1: \beta_1 \neq 0\) is less than \(2 \times 10^{-16}\).

2.2 Solution (does adding a dummy variable (elevator) change the relationship?):

The regression model including the dummy variable for elevator is written \[\begin{equation} Y = \beta_0 + \beta_1 x_1 + \beta_2 D_1 + \beta_3 x_1 D_1 + \varepsilon \tag{2.2} \end{equation}\] where \[\begin{equation*} D_1 = \begin{cases} 0 &\text{when a building has no elevators}\\ 1 &\text{when a building has at least one elevator.} \end{cases} \end{equation*}\]

To determine if the lines are the same (which means that the linear relationship between appraised price and living area is the same for apartments with and without elevators), the hypotheses are

\[\begin{equation*} H_0: \beta_2 = \beta_3 = 0 \text{ versus } H_1: \text{at least one } \beta_i \neq 0 \text{ for } i=2, 3. \end{equation*}\]

mod_full <- lm(totalprice ~ area + elevator + area:elevator, data = VIT2005)
anova(mod_simple, mod_full)  # compare models
Analysis of Variance Table

Model 1: totalprice ~ area
Model 2: totalprice ~ area + elevator + area:elevator
  Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
1    216 3.5970e+11                                  
2    214 3.0267e+11  2 5.704e+10 20.165 9.478e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this problem, one may conclude that at least one of \(\beta_2\) and \(\beta_3\) is not zero since the p-value = \(9.4780144\times 10^{-9}\). In other words, the lines have either different intercepts, different slopes, or different intercepts and slopes.

To see if the lines have the same slopes (which means that the presence of an elevator adds constant value over all possible living areas), the hypotheses are \[\begin{equation*} H_0: \beta_3 = 0 \text{ versus } H_1: \beta_3 \neq 0. \end{equation*}\]

anova(mod_full)
Analysis of Variance Table

Response: totalprice
               Df     Sum Sq    Mean Sq  F value    Pr(>F)    
area            1 6.8239e+11 6.8239e+11 482.4846 < 2.2e-16 ***
elevator        1 4.5308e+10 4.5308e+10  32.0352  4.83e-08 ***
area:elevator   1 1.1732e+10 1.1732e+10   8.2949   0.00438 ** 
Residuals     214 3.0267e+11 1.4143e+09                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-value = \(0.0043797\), it may be concluded that \(\beta_3 \neq 0\), which implies that the lines are not parallel.

To test for equal intercepts (which means that appraised price with and without elevators starts at the same value), the hypotheses to be evaluated are \[\begin{equation*} H_0: \beta_2 = 0 \text{ versus }H_1: \beta_2 \neq 0. \end{equation*}\]

mod_full <- lm(totalprice ~ area + elevator + area:elevator, data = VIT2005)
mod_inter <- lm(totalprice ~ area + area:elevator, data = VIT2005)
anova(mod_inter, mod_full)  # compare models
Analysis of Variance Table

Model 1: totalprice ~ area + area:elevator
Model 2: totalprice ~ area + elevator + area:elevator
  Res.Df        RSS Df  Sum of Sq      F Pr(>F)
1    215 3.0624e+11                            
2    214 3.0267e+11  1 3576497188 2.5288 0.1133

Since the p-value for testing the null hypothesis is 0.1132635, one fails to reject \(H_0\) and should conclude that the two lines have the same intercept but different slopes.

summary(mod_inter)

Call:
lm(formula = totalprice ~ area + area:elevator, data = VIT2005)

Residuals:
    Min      1Q  Median      3Q     Max 
-125093  -21762   -2201   18117  112252 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71352.08   12309.18   5.797 2.39e-08 ***
area              1897.94     180.59  10.510  < 2e-16 ***
area:elevatorYes   553.99      90.42   6.127 4.23e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37740 on 215 degrees of freedom
Multiple R-squared:  0.7061,    Adjusted R-squared:  0.7034 
F-statistic: 258.3 on 2 and 215 DF,  p-value: < 2.2e-16
coef(mod_inter)
     (Intercept)             area area:elevatorYes 
      71352.0844        1897.9368         553.9856 
b0 <- coef(mod_inter)[1]
b1NO <- coef(mod_inter)[2]
b1YES <- coef(mod_inter)[2] + coef(mod_inter)[3]
c(b0, b1NO, b1YES)
(Intercept)        area        area 
  71352.084    1897.937    2451.922 

The fitted model is \(\widehat{Y}_i = 7.1352084\times 10^{4} + 1897.9368262x_{i1} + 553.9856453x_{i1}D_{i1}\), and the fitted regression lines for the two values of \(D_1\) are shown in Figure 2.2. The fitted model using the same intercept with different slopes has an \(R^2_a\) of 0.7033949, a modest improvement over the model without the variable elevator, which had an \(R^2_a\) value of 0.6532269.

ggplot(data = VIT2005, aes(x = area, y = totalprice, color = elevator)) + 
  geom_point(alpha = 0.5) + 
  theme_bw() + 
  geom_abline(intercept = b0, slope = b1NO, color = "red") + 
  geom_abline(intercept = b0, slope = b1YES, color = "blue") + 
  scale_color_manual(values = c("red", "blue")) + 
  xlim(10, 200) + 
  ylim(50000, 500000) + 
  labs(x = "Living Area is Square Meters",
       y = "Appraised Price in Euros")
Fitted regression lines for `mod_inter`

Figure 2.2: Fitted regression lines for mod_inter


3 Diagnostics

MDF <- get_regression_points(mod_inter)
ggplot(data = MDF, aes(x = totalprice_hat, y = residual)) +
  geom_point() +
  theme_bw() +
  labs(title = "Residuals versus Fitted Values") +
  geom_hline(yintercept = 0, lty = "dashed")

ggplot(data = MDF, aes(x = residual)) +
  geom_histogram(fill = "lightblue", color = "blue") +
  theme_bw() 

ggplot(data = MDF, aes(sample = residual)) +
  geom_qq() +
  geom_qq_line() +
  theme_bw() 


4 Example

Consider the MA_schools data frame from the moderndive package which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education. Consider a model with SAT math scores (average_sat_math) modeled as a function of percentage of the high school’s student body that are economically disadvantaged (perc_disadvan) and the a categorical variable measuring school size (size).

4.1 Solution (is there a relationship between average_sat_math and perc_disadvan?):

ggplot(data = MA_schools, aes(x = perc_disadvan, y = average_sat_math)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(method = "lm", se = FALSE)

mod_simple <- lm(average_sat_math ~ perc_disadvan, data = MA_schools)
summary(mod_simple)

Call:
lm(formula = average_sat_math ~ perc_disadvan, data = MA_schools)

Residuals:
   Min     1Q Median     3Q    Max 
-80.74 -21.26  -4.12  18.54 174.17 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   581.2811     3.2668   177.9   <2e-16 ***
perc_disadvan  -2.7798     0.1011   -27.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.54 on 330 degrees of freedom
Multiple R-squared:  0.6962,    Adjusted R-squared:  0.6953 
F-statistic: 756.2 on 1 and 330 DF,  p-value: < 2.2e-16
get_regression_table(mod_simple)
# A tibble: 2 × 7
  term          estimate std_error statistic p_value lower_ci upper_ci
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept       581.       3.27      178.        0   575.     588.  
2 perc_disadvan    -2.78     0.101     -27.5       0    -2.98    -2.58

You complete the rest….


4.2 Solution (does adding a dummy variable size change the relationship?):

ggplot(data = MA_schools, aes(x = perc_disadvan, y = average_sat_math, color = size)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(method = "lm", se = FALSE) -> p1
ggplot(data = MA_schools, aes(x = perc_disadvan, y = average_sat_math, color = size)) + 
  geom_point() +
  theme_bw() + 
  geom_parallel_slopes(se = FALSE) -> p2
library(patchwork)
p1 + p2

mod_full <- lm(lm(average_sat_math ~ perc_disadvan + size + perc_disadvan:size, data = MA_schools))
anova(mod_simple, mod_full)
Analysis of Variance Table

Model 1: average_sat_math ~ perc_disadvan
Model 2: average_sat_math ~ perc_disadvan + size + perc_disadvan:size
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    330 371191                           
2    326 367669  4    3521.5 0.7806 0.5384
anova(mod_full)
Analysis of Variance Table

Response: average_sat_math
                    Df Sum Sq Mean Sq  F value Pr(>F)    
perc_disadvan        1 850615  850615 754.2112 <2e-16 ***
size                 2   3133    1566   1.3888 0.2508    
perc_disadvan:size   2    389     194   0.1724 0.8417    
Residuals          326 367669    1128                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5 Simpson’s Paradox

library(ISLR)
credit_paradox <- Credit %>% 
  select(ID, debt = Balance, credit_limit = Limit, 
         credit_rating = Rating, income = Income, age = Age)
ggplot(data = credit_paradox, aes(x = credit_limit, y = debt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() -> p1
ggplot(data = credit_paradox, aes(x = income, y = debt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() -> p2
library(patchwork)
p1 + p2

library(plotly)
p <- plot_ly(data = credit_paradox, z = ~debt, x = ~credit_limit, y = ~income) %>% 
  add_markers()
p
mod <- lm(debt ~ credit_limit + income, data = credit_paradox)
summary(mod)$coef
                 Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)  -385.1792604 19.464801525 -19.78850  3.878764e-61
credit_limit    0.2643216  0.005879729  44.95471 7.717386e-158
income         -7.6633230  0.385072058 -19.90101  1.260933e-61
x <- seq(0, 14000, length = 70)
y <- seq(0, 150, length = 70) 
plane <- outer(x, y, function(a, b){summary(mod)$coef[1,1] + summary(mod)$coef[2, 1]*a + summary(mod)$coef[3,1]*b})
# draw the plane
p %>% 
  add_surface(x = ~x, y = ~y, z = ~plane)
qs <- quantile(credit_paradox$credit_limit, probs = seq(0, 1, .25))
# credit_paradox$credit_cats <- cut(credit_paradox$credit_limit, breaks = qs, include.lowest = TRUE)
############### Or above
credit_paradox <- credit_paradox %>% 
  mutate(credit_cats = cut(credit_limit, breaks = qs, include.lowest = TRUE))
head(credit_paradox)
  ID debt credit_limit credit_rating  income age         credit_cats
1  1  333         3606           283  14.891  34 (3.09e+03,4.62e+03]
2  2  903         6645           483 106.025  82 (5.87e+03,1.39e+04]
3  3  580         7075           514 104.593  71 (5.87e+03,1.39e+04]
4  4  964         9504           681 148.924  36 (5.87e+03,1.39e+04]
5  5  331         4897           357  55.882  68 (4.62e+03,5.87e+03]
6  6 1151         8047           569  80.180  77 (5.87e+03,1.39e+04]
ggplot(data = credit_paradox, aes(x = credit_limit)) +
  geom_density(fill = "pink", color = "black") + 
  geom_vline(xintercept = qs, color = "blue") + 
  theme_bw()

credit_paradox %>% 
  group_by(credit_cats) %>% 
  summarize(n())
# A tibble: 4 × 2
  credit_cats         `n()`
  <fct>               <int>
1 [855,3.09e+03]        100
2 (3.09e+03,4.62e+03]   100
3 (4.62e+03,5.87e+03]   100
4 (5.87e+03,1.39e+04]   100
p1 <- ggplot(data = credit_paradox, aes(x = income, y = debt)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() + 
  labs(y = "Credit card debt (in $)",
       x = "Income (in $1000)")
p2 <- ggplot(data = credit_paradox, aes(x = income, y = debt, color = credit_cats)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  theme_bw() + 
  labs(y = "Credit card debt (in $)",
       x = "Income (in $1000)",
       color = "Credit limit bracket")
p1 + p2
Relationship between credit card debt and income by credit limit bracket

Figure 5.1: Relationship between credit card debt and income by credit limit bracket